t-Plausibility: Generalizing Words to Desensitize Text
نویسندگان
چکیده
De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates), they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis “tuberculosis” is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term “infectious disease” also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.
منابع مشابه
Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts
Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of m...
متن کاملدفاع از اصالت ادعیۀ اهل بیت(ع): مطالعۀ موردی دعای عرفه
The Arafa supplication with its rhythmic text is a prayer in about 3200 words, eloquent in style and with only a few different versions with slight variances. The subject matter of this supplication is to confess to the sublimity and glory of God, before the worldly, tragic and unstable situation of man. In addition, the praying person praises God for his abundant mercies and invokes God's bles...
متن کاملHigh-Probability Syntactic Links
In this example, however, by the moment the word is has been read, the word p r o b l e m is already engaged in other strongly predicted constructions, namely the prepositional phrase of" this p r o b l e m and even the whole noun phrase the s o l u t i o n o f this p r o b l e m . A conflict arises, and plausibility of the new hypothesis becomes much lower. Such syntactic relations may concern...
متن کاملGeneralizing Automatically Generated Selectional Patterns
Frequency information on co-occurrence pa t te rns can be att tomatically collected from a syntactically analyzed corpus; this information can then serve as the basis for selectional constraints when analyzing new text; from the same domain. Tiffs information, however, is necessarily incomplete. We report on measurements of the degree of selectional coverage obtained with ditt\~rent sizes of co...
متن کاملLearning a Scanning Understanding for "Real-world" Library Categorization
This paper describes, compares, and evaluates three different approaches for learning a semantic classification of library titles: 1) syntactically condensed titles, 2) complete titles, and 3) titles without insignificant words are used for learning the classification in connectionist recurrent plausibility networks. In particular, we demonstrate in this paper that automatically derived feature...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Trans. Data Privacy
دوره 5 شماره
صفحات -
تاریخ انتشار 2012